Like many organizations over the past year, we’ve been experimenting with generative AI. We’ve built proof of concept and prototype applications to assist in researching company documents and scaling operations, and we’ve experimented with “smart agents” that can help exercise rights on consumers’ behalf and otherwise represent them. Generative AI is clearly a powerful technology with potential to be a force for consumers, yet wrangling that power in a way that is consistent, transparent and accountable is difficult.
To dive into one example, one of the prototypes we’ve been developing is a conversational research assistant that knows everything CR knows and provides trusted, transparent advice to consumers conducting product research. Though we’re still pre-MVP, we’re learning a lot about what responsible development means to us and how to put it in practice. So far, we have four main takeaways:
1. Off the shelf RAG doesn’t work (for us)
Though we started out using pretty standard Retrieval Augmented Generation (RAG) architecture in our prototypes, we’ve surprised ourselves with how much we’ve needed to customize even simple proofs of concept.
We are solving a very specific problem: helping our users perform simple product research tasks while consulting CR’s vast trove of articles, ratings and reviews. What’s cool about this particular problem space and data set, is that the answers we’re seeking from the LLM are pretty objective. For example, if the user asks “What’s the best lightweight snowblower?,” CR has an obvious answer – the lightest snowblower with the highest ratings in our lab tests. Using RAG techniques available in many frameworks, we don’t reliably get back the correct answer to this question. However, by designing something more custom, we’ve seen answers dramatically improve.
2. Make it easy for LLMs to succeed
The LLMs that most users are used to interacting with – via experiences like chatGPT or CoPilot – are designed to answer a vast array of questions pretty well most of the time. How worrying the “pretty well” and “most of the time” parts are depends on the question, how correct you need the answer to be and what the consequences are of it being incorrect. This applies more so for agentic AI where the goal is ideally a reliable automated process that actually does something for users.
In our conversational prototype, we care about answering a specific set of questions with very high accuracy. We’ve seen that this means building more deterministic systems with components that infer user intent in order to route each question to the right set of tools, and then dynamically injecting system prompts, few-shot examples and other instructions.
3. Be prepared to handle anything
We all saw the egg on face news stories of major conversational AI product launches – these systems aren’t able to reason with and return acceptable answers for everything.
Of course it’s impossible to plan for every single off-the-wall or harmful question. The long tail of user input and LLM output is unpredictable, however tools like input guardrails can help bucket out-of-scope questions and return canned responses so that inappropriate inputs never even reach the LLM.
Input guardrails aren’t just helpful for harmful or antagonistic questions. For example, in the prototype we’re building, we only want to answer questions about products that CR tests or otherwise has good intelligence on. A user may have a question about a certain product category – like binoculars or home video projectors – where we don’t have authoritative data to answer the question. Rather than having that question reach an LLM and risk a possible hallucination of an answer, we’re employing input guardrails to catch the question and return a default response.
4. Evaluate intentionally
Our prototyping process has benefited immensely from an “Evaluation Driven Development” approach. Evaluation Driven Development involves creating test data sets of questions and answers, feeding them through the LLM orchestration to retrieve outputs, and then evaluating the outputs using both automated and human-in-the-loop methods.
We started out using a grab bag of metrics for evaluation but quickly realized that many of these didn’t feel essential to what mattered to this particular project. For our prototype, we cared about correctness and style of responses, with particular interest in the correctness of intermediate steps (guardrails, router and retrieval). We’ve developed MVP ways of measuring these, and plan to make the measurements more robust and sophisticated with time.
There are many more learnings to come, and we’ll be keeping the community updated on our progress and lessons learned here. Stay tuned!